feat: add array_normalize scalar function by crm26 · Pull Request #22013 · apache/datafusion

crm26 · 2026-05-05T00:46:14Z

Which issue does this PR close?

Part of #21536 — split of #21371 into one-function-per-PR. Third in the series after #21542 (cosine_distance) and #21861 (inner_product).

Rationale for this change

Adds array_normalize(array) — the L2-normalized version of a numeric input vector. Computed as array[i] / sqrt(sum(array[i]^2)) per element. Returns the same shape as the input (List<Float64> or LargeList<Float64>).

Aliased as list_normalize to match the array_X/list_X convention used across the crate.

What changes are included in this PR?

Coercion shell mirrors the merged cosine_distance/inner_product pattern:

coerce_types accepts List/LargeList/FixedSizeList of any numeric inner type, plus bare NULL. After coercion the inner function only sees List(Float64) or LargeList(Float64).
Per-row L2 norm computed inline (no shared module), using a single as_float64_array(list_array.values()) downcast plus value_offsets() slicing — no per-row downcasts.
Manual list builder: Vec<f64> for values, Vec<O> for offsets, NullBuffer for row validity.

Per-row semantics:

NULL row → NULL output
NULL element in list → NULL row
Empty list → empty list (no division-by-zero hazard)
Zero magnitude → NULL row (consistent with cosine_distance's zero-magnitude → NULL)
Otherwise → divide each element by sqrt(sum-of-squares)

Are these changes tested?

Yes. SLT covers:

3-4-5 right triangle, 3D vector, already-unit-axis, single non-zero component, negative components
Bare NULL input, NULL element in list, zero vector, empty array
LargeList, FixedSizeList (via coercion), Float32 and Int64 inner types, integer literals
Multi-row query mixing normal / NULL row / zero-vector row / null-element row
Plan error for non-list input
No-args error
Return-type assertion (List(Float64))
list_normalize alias coverage (constant + multi-row with NULL)

Are there any user-facing changes?

New scalar function array_normalize (alias list_normalize), documented in docs/source/user-guide/sql/scalar_functions.md.

Jefffrey · 2026-05-05T02:01:21Z

+    let mut new_values: Vec<f64> = Vec::with_capacity(values.len());
+    let mut new_offsets: Vec<O> = Vec::with_capacity(list_array.len() + 1);
+    new_offsets.push(O::usize_as(0));
+    let mut validity: Vec<bool> = Vec::with_capacity(list_array.len());


Use NullBufferBuilder here instead. One benefit is when finishing it, it may output None if there are no nulls (currently we always provide a null buffer even if there are no nulls)

Swapped to NullBufferBuilder — append_null() / append_non_null() per row, nulls.finish() returns None when no nulls accumulated, so we stop emitting a redundant null buffer on all-valid inputs. Thanks.

Jefffrey · 2026-05-05T02:03:12Z

+    let offsets = list_array.value_offsets();
+
+    let mut new_values: Vec<f64> = Vec::with_capacity(values.len());
+    let mut new_offsets: Vec<O> = Vec::with_capacity(list_array.len() + 1);


I think it might be simpler to use OffsetBufferBuilder here

Thanks @Jefffrey — swapped to OffsetBufferBuilder in c3576a30e. Each branch now uses push_length(0) for null/null-element/empty/zero-mag rows and push_length(len) for valid rows; final buffer from new_offsets.finish(). Cleaner than the manual Vec<O> + OffsetBuffer::new(... .into()).

Jefffrey · 2026-05-20T02:02:40Z

Thanks @crm26

Adds `array_add(array1, array2)` returning the element-wise sum of two numeric arrays. Aliased as `list_add`. Follows the per-function split pattern established by cosine_distance (apache#21542), inner_product (apache#21861), and array_normalize (apache#22013) per tracking issue apache#21536. Semantics: - NULL row in either input -> NULL row out - NULL element at position i in either input -> NULL element at i out (per-element propagation, divergent from inner_product which nulls the whole row; chosen because output is a list, not a scalar) - Length mismatch between rows -> exec_err - Empty arrays -> empty array Supports List, LargeList, and FixedSizeList inputs; numeric element types are coerced to Float64. If any input is LargeList, both sides are widened to LargeList for homogeneous runtime dispatch. Uses OffsetBufferBuilder + NullBufferBuilder per the pattern adopted in array_normalize round 1.

@alamb

## Which issue does this PR close? Partial of apache#21536 — `array_scale` (the list+scalar arithmetic function in the vector math series). ## Rationale for this change Continues the per-function split requested by @alamb on apache#21536. Three sibling PRs already merged: `cosine_distance` (apache#21542), `inner_product` (apache#21861), `array_normalize` (apache#22013). `array_add` is in flight as apache#22459 by @SubhamSinghal. Adds element-wise scalar multiplication for numeric arrays, returning a list of the same shape. Aliased as `list_scale` to match the `array_X` / `list_X` precedent in this crate. ## What changes are included in this PR? - New scalar UDF `array_scale(array, scalar)` in `datafusion/functions-nested/src/array_scale.rs` - Module wire-up + registration in `datafusion/functions-nested/src/lib.rs` - SLT tests at `datafusion/sqllogictest/test_files/array_scale.slt` - Auto-generated function docs entry in `docs/source/user-guide/sql/scalar_functions.md` **Signature:** first arg `List/LargeList/FixedSizeList<numeric>`, second arg numeric scalar. Both coerce to `Float64`. Same list-widening rules as the binary-op siblings. **NULL semantics:** - NULL row in array → NULL row out - NULL scalar → NULL row out (whole-row, because the scalar applies uniformly) - NULL element at position \`i\` → NULL element at \`i\` out (per-element propagation) - Empty array → empty array **Builders:** uses \`OffsetBufferBuilder\` + \`NullBufferBuilder\` per the pattern adopted in the round-1 review of apache#22013. ## Are these changes tested? Yes. \`array_scale.slt\` covers: - Happy paths (positive, negative, zero, fractional, single-element) - NULL propagation at all three levels (NULL row, NULL scalar, NULL element) - All list type variants (\`List\`, \`LargeList\`, \`FixedSizeList\`) - Numeric inner type coercion (Float32, Int64, integer literals) - Multi-row queries with both constant-scalar broadcast and per-row column scalar - Error paths (non-numeric scalar, non-list first arg, wrong arity) - Empty array - \`list_scale\` alias ## Are there any user-facing changes? Yes — new SQL scalar function \`array_scale(array, scalar)\` and its alias \`list_scale\`. Documented in \`docs/source/user-guide/sql/scalar_functions.md\`.

Adds `array_sum(array)` returning the sum of elements in a numeric array. Aliased as `list_sum`. Part of the per-function split sequence on tracking issue apache#21536, following the pattern of the already-merged PRs in this series (cosine_distance apache#21542, inner_product apache#21861, array_normalize apache#22013, array_scale apache#22466). Semantics: - NULL row in array -> NULL row out - NULL elements are skipped (SQL aggregate convention; matches PostgreSQL array_sum, DuckDB list_sum, Spark aggregate). A row whose every element is NULL yields NULL. - Empty array -> 0.0 (additive identity, matches SQL SUM over no rows conceptually, and DuckDB list_sum([]) = 0) Input is List/LargeList/FixedSizeList of any numeric type; elements are coerced to Float64. Output is Float64.

github-actions Bot added documentation Improvements or additions to documentation sqllogictest SQL Logic Tests (.slt) functions Changes to functions implementation labels May 5, 2026

Jefffrey reviewed May 5, 2026

View reviewed changes

feat: add array_normalize scalar function

c3576a3

crm26 force-pushed the feat/array-normalize branch from 557d221 to c3576a3 Compare May 16, 2026 19:11

Jefffrey approved these changes May 20, 2026

View reviewed changes

Jefffrey added this pull request to the merge queue May 20, 2026

Merged via the queue into apache:main with commit 821260f May 20, 2026
36 checks passed

This was referenced May 20, 2026

feat: add vector distance and array math functions #21371

Closed

Add vector distance, array math, and array aggregate functions #21536

Open

crm26 mentioned this pull request May 22, 2026

feat: add array_scale scalar function #22466

Merged

crm26 mentioned this pull request May 26, 2026

feat: add array_sum scalar function #22542

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add array_normalize scalar function#22013

feat: add array_normalize scalar function#22013
Jefffrey merged 1 commit into
apache:mainfrom
crm26:feat/array-normalize

crm26 commented May 5, 2026

Uh oh!

Jefffrey May 5, 2026

Uh oh!

crm26 May 16, 2026

Uh oh!

Jefffrey May 5, 2026

Uh oh!

crm26 May 16, 2026

Uh oh!

Jefffrey commented May 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

crm26 commented May 5, 2026

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

Jefffrey May 5, 2026

Choose a reason for hiding this comment

Uh oh!

crm26 May 16, 2026

Choose a reason for hiding this comment

Uh oh!

Jefffrey May 5, 2026

Choose a reason for hiding this comment

Uh oh!

crm26 May 16, 2026

Choose a reason for hiding this comment

Uh oh!

Jefffrey commented May 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants